The aim of the competition is to develop a computational model that predicts which molecules will block the malaria parasite's ion pump, PfATP4.
Submitted by James McCulloch - james.duncan.mcculloch@gmail.com
Each predictive model based on fingerprints or another SMILE based description vector such as DRAGON brings a certain amount of predictive power to the task of assessing likely molecular activity against PfATP4.
What the meta model does is combine the predictive power of each model in an optimal way to produce a more predictive composite model.
It does this by taking as it's input the probability maps (the outputs) of other classifiers,
The two models chosen as inputs to the meta model are:
A Neural Network model that uses the DRAGON molecular descriptor to estimate molecular PfATP4 ion activity directly. This model had modest predictive power of AUC=0.77. See the first notebook for details.
A logistic classifier that uses the Morgan fingerprints (mol radius = 5) to predict the EC50 <= 500 nMol class. This model was discussed in notebook 2 and has a predictive power of AUC=0.93 for the test molecules. Crucially, this predictive power is for EC50 only, not PfATP4 ion activity. For the test set, EC50 and PfATP4 ion activity are closely correlated because these molecules have similar structures and were designed to be active against PfATP4. However, other molecules from the training set with different structures have different sites of activity and membership of the EC50 <= 500 nMol class is not predictive of PfATP4 ion activity.
A DNN and a variety SKlearn classifiers were trained as Meta Models against the probability maps of the 2 models described above and the resultant Area Under Curve (AUC) statistics against the test molecules are tabulated below. Note the meta model is a binary classifier [ACTIVE, INACTIVE] for ion activity it does not attempt to classify molecules as [PARTIAL].
In [8]:
from IPython.display import display
import pandas as pd
print("Meta Results")
meta_results = pd.read_csv("./meta_results.csv")
display(meta_results)
Where the META MODELs are as follows:
DNN - A Deep Neural Network classifier [16, 32, 32, 16, 2] from the Keras toolkit. Cross-entropy loss function.
NBC - A Naive Bayes Classifier
SVMC - A support vector machine classifier.
LOGC - A Logistic classifier.
The Meta Models run on Linux and Windows under Python 2.7 and 3.5 (Mac untested):
Download (follow the readme setup) the entire directory tree from google drive here: https://github.com/kellerberrin/OSM-QSAR. Detailed instructions will be also posted so that the withheld molecules can be tested against the optimal model with minimum hassle. The pre-trained DRAGON classifier "ION_DRAGON_625.krs" must be in the model directories. In addition, for the "osm" model, the pre-trained meta model "ION_META_40.krs" should be in the "osm" model directory. The software should give sensible error messages if they are missing.
Make sure you have setup and activated the python anaconda environment as described in "readme.md".
For the optimal OSM meta model (--help for flag descriptions) the following cmd was used (the clean flag is optional it removes previous results from the model directory):
$python OSM_QSAR.py --classify osm --load ION_META --epoch 40 --train 0 [--clean]
For the svmc SKLearn meta model (--help for flag descriptions) the following cmd was used (the clean flag is optional it removes previous results from the model directory):
$python OSM_QSAR.py --classify osm_sk [--clean]
In [5]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from pylab import *
from sklearn.preprocessing import minmax_scale
def sort_map(column):
array = minmax_scale(train_results[column])
return array[np.argsort(-array)]
scale = 1.0
fig = plt.figure(num=None, figsize=(8 * scale, 6 * scale), dpi=80, facecolor='w', edgecolor='k')
for map in all_active: plt.plot(sort_map(map), label=map)
xlabel("molecules")
ylabel("normalized probability")
title(" Training Set [ACTIVE] Probability Maps")
legend(loc=1); # upper right corner
In [6]:
def mol_label_list(data_frame): # Function to produce rdkit mols and associated molecular labels
id = data_frame["ID"].tolist()
klass = data_frame["ACTUAL_500"].tolist()
potency = data_frame["EC50"].tolist()
ion_activity = data_frame["ION_ACTIVITY"].tolist()
map_prob = data_frame["M5_500_250"].tolist()
labels = []
for idx in range(len(id)):
labels.append("{} {} {} {} {:5.0f} ({:5.4f})".format(idx+1, id[idx],
klass[idx][0], ion_activity[idx][0],
potency[idx]*1000, map_prob[idx]))
smiles = data_frame["SMILE"].tolist()
mols = [Chem.MolFromSmiles(smile) for smile in smiles]
return mols, labels
In [7]:
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
IPythonConsole.ipython_useSVG=True
In [8]:
ion_active = ec50_200_active.loc[train_results["ION_ACTIVITY"] == "ACTIVE"].sort_values("EC50")
mols, labels = mol_label_list(ion_active)
Draw.MolsToGridImage(mols,legends=labels,molsPerRow=4)
Out[8]:
In [11]:
sorted = test_results.sort_values("M5_500_250", ascending=False)
mols, labels = mol_label_list(sorted)
Draw.MolsToGridImage(mols,legends=labels,molsPerRow=4)
Out[11]: